Partitioning-based clustering for Web document categorization

نویسندگان

  • Daniel Boley
  • Maria L. Gini
  • Robert Gross
  • Eui-Hong Han
  • Kyle Hastings
  • George Karypis
  • Vipin Kumar
  • Bamshad Mobasher
  • Jerome Moore
چکیده

Clustering techniques have been used by many intelligent software agents in order to retrieve lter and categorize documents available on the World Wide Web Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web Traditional clustering algorithms either use a priori knowledge of document structures to de ne a distance or similarity among these documents or use probabilistic techniques such as Bayesian classi cation Many of these traditional algorithms however falter when the dimensionality of the feature space becomes high relative to the size of the document space In this paper we introduce two new clustering algorithms that can e ectively cluster documents even in the presence of a very high dimensional feature space These clustering techniques which are based on generalizations of graph partitioning do not require pre speci ed ad hoc distance functions and are capable of automatically discovering document similarities or associations We conduct several experiments on real Web data using various feature selection heuristics and compare our clustering schemes to standard distance based techniques such as hierarchical agglomeration clustering and Bayesian classi cation methods such as AutoClass

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Document Clustering Using Threshold Selection Partitioning

Clustering techniques have been applied to categorize documents on World Wide Web. In previous research, PDDP (Principal Direction Divisive Partitioning) is a well-known clustering algorithm. PDDP algorithm employs top-down and unsupervised clustering based on the principal component analysis and splits documents into two sets using a plane perpendicular to the maximum principal direction passi...

متن کامل

Analysis of Clustering Algorithms for Web-Based Search

Automatic document categorization plays a key role in the development of future interfaces for Web-based search. Clustering algorithms are considered as a technology that is capable of mastering this “ad-hoc” categorization task. This paper presents results of a comprehensive analysis of clustering algorithms in connection with document categorization. The contributions relate to exemplar-based...

متن کامل

Impact of Similarity Measures on Web-page Clustering

Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possi...

متن کامل

Data Mining Process Using Clustering : A Survey

Clustering is a basic and useful method in understanding and exploring a data set. Clustering is division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Interest in clustering has increased recently in new areas of applications including data mining, bioinformatics, web mining...

متن کامل

Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering

Clustering techniques have been used by many intelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of doc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Decision Support Systems

دوره 27  شماره 

صفحات  -

تاریخ انتشار 1999